This Project is a further analysis of former project 1 (EDA project). Same data set will be applied and deeper data analytics methods will be practiced.
The report will be divided into three main parts: data preparation, classification and clustering.
In the first part, original data will be cleaned, transferred to reduce the noise and redundancy. Then, a target classification variable will be decided and proceeded by merging with economic data from World Bank. This part will deliver a decent data set for subsequent steps.
In the second part, classification will be applied. The process will apply feature selection based on 3 methods: 1) Single variable ( AUC & log-likelihood); 2) Boruta algrorithm; 3) Combination method (Chi-Square and correlation coefficient). 3 combinations of variables will be selected. After that, 3 multi-variables classifiers are practiced to test the former 3 combinations, they are 1) Decision tree classifier; 2) XGBoost classifier; 3) Logistic regression. Each model will be evaluated individually to find one of the best variable combination. Then, each model would be explained (Decision Tree by plot, XGBoost and Logistic Regression by LIME Interpreter). In the end, this 3 models from different classifier with its best variable combination will be evaluated again to gain a final best model.
In the third part, clustering will be conducted. The first step is to proceed the data set as clustering is a unsupervised method without involving a target variable. Then 2 methods: hierarchical clustering and K-means clustering will be conducted to group all the data into different clusters. Both of the methods will attempt to find the best k value and being explained.
In this project, there are some specific terms need to be clarify:
Classifier: the method for classification. e.g. Decision Tree Classifier, XGBoost Classifier
Combination/Variable Combination: the variable combination delivered from feature selection. For instance, Combination 1 indicate the combination of variable: “region” + “viewers” + “uploads”
Model: the final combine set of A Classifier and A combination. For instance, the XGBoost with Variable Combination 1 can contribute to a Model.
Clusters: the groups divided from all data points to small groups by clustering methods.
Import the packages will be used in the project:
library(ggplot2)
library(dplyr)
library(gridExtra)
library(maps)
library(tidyverse)
library(maps)
library(ggthemes)
library(mice)
library(ROCR)
library(vtreat)
library(ROCit)
library(rpart)
library(rpart.plot)
library(pander)
library(class)
library(xgboost)
library(Boruta)
library(caret)
library(PerformanceAnalytics)
library(lime)
library(pROC)
library(grDevices)
library(fpc)
library(vip)
library(shiny)Load the data set:
df <- read.csv('/Users/lianmingyu/Desktop/CITS4009 Computational Data Analysis/Project 1/Global YouTube Statistics.csv', header =TRUE)First, as the former project did, the structure and summary function will be attempted to have a glance on the data:
## 'data.frame': 995 obs. of 28 variables:
## $ rank : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Youtuber : chr "T-Series" "YouTube Movies" "MrBeast" "Cocomelon - Nursery Rhymes" ...
## $ subscribers : int 245000000 170000000 166000000 162000000 159000000 119000000 112000000 111000000 106000000 98900000 ...
## $ video.views : num 2.28e+11 0.00 2.84e+10 1.64e+11 1.48e+11 ...
## $ category : chr "Music" "Film & Animation" "Entertainment" "Education" ...
## $ Title : chr "T-Series" "youtubemovies" "MrBeast" "Cocomelon - Nursery Rhymes" ...
## $ uploads : int 20082 1 741 966 116536 0 1111 4716 493 574 ...
## $ Country : chr "India" "United States" "United States" "United States" ...
## $ Abbreviation : chr "IN" "US" "US" "US" ...
## $ channel_type : chr "Music" "Games" "Entertainment" "Education" ...
## $ video_views_rank : int 1 4055159 48 2 3 4057944 5 44 630 8 ...
## $ country_rank : num 1 7670 1 2 2 NaN 3 1 5 5 ...
## $ channel_type_rank : num 1 7423 1 1 2 ...
## $ video_views_for_the_last_30_days : num 2.26e+09 1.20e+01 1.35e+09 1.98e+09 1.82e+09 ...
## $ lowest_monthly_earnings : num 564600 0 337000 493800 455900 ...
## $ highest_monthly_earnings : num 9.0e+06 5.0e-02 5.4e+06 7.9e+06 7.3e+06 ...
## $ lowest_yearly_earnings : num 6.8e+06 4.0e-02 4.0e+06 5.9e+06 5.5e+06 ...
## $ highest_yearly_earnings : num 1.08e+08 5.80e-01 6.47e+07 9.48e+07 8.75e+07 ...
## $ subscribers_for_last_30_days : num 2e+06 NaN 8e+06 1e+06 1e+06 NaN NaN NaN 1e+05 6e+05 ...
## $ created_year : num 2006 2006 2012 2006 2006 ...
## $ created_month : chr "Mar" "Mar" "Feb" "Sep" ...
## $ created_date : num 13 5 20 1 20 24 12 29 14 23 ...
## $ Gross.tertiary.education.enrollment....: num 28.1 88.2 88.2 88.2 28.1 NaN 88.2 63.2 81.9 88.2 ...
## $ Population : num 1.37e+09 3.28e+08 3.28e+08 3.28e+08 1.37e+09 ...
## $ Unemployment.rate : num 5.36 14.7 14.7 14.7 5.36 NaN 14.7 2.29 4.59 14.7 ...
## $ Urban_population : num 4.71e+08 2.71e+08 2.71e+08 2.71e+08 4.71e+08 ...
## $ Latitude : num 20.6 37.1 37.1 37.1 20.6 ...
## $ Longitude : num 79 -95.7 -95.7 -95.7 79 ...
## rank Youtuber subscribers video.views
## Min. : 1.0 Length:995 Min. : 12300000 Min. :0.000e+00
## 1st Qu.:249.5 Class :character 1st Qu.: 14500000 1st Qu.:4.288e+09
## Median :498.0 Mode :character Median : 17700000 Median :7.761e+09
## Mean :498.0 Mean : 22982412 Mean :1.104e+10
## 3rd Qu.:746.5 3rd Qu.: 24600000 3rd Qu.:1.355e+10
## Max. :995.0 Max. :245000000 Max. :2.280e+11
##
## category Title uploads Country
## Length:995 Length:995 Min. : 0.0 Length:995
## Class :character Class :character 1st Qu.: 194.5 Class :character
## Mode :character Mode :character Median : 729.0 Mode :character
## Mean : 9187.1
## 3rd Qu.: 2667.5
## Max. :301308.0
##
## Abbreviation channel_type video_views_rank country_rank
## Length:995 Length:995 Min. : 1 Min. : 1.0
## Class :character Class :character 1st Qu.: 323 1st Qu.: 11.0
## Mode :character Mode :character Median : 916 Median : 51.0
## Mean : 554249 Mean : 386.1
## 3rd Qu.: 3584 3rd Qu.: 123.0
## Max. :4057944 Max. :7741.0
## NA's :1 NA's :116
## channel_type_rank video_views_for_the_last_30_days lowest_monthly_earnings
## Min. : 1.0 Min. :1.000e+00 Min. : 0
## 1st Qu.: 27.0 1st Qu.:2.014e+07 1st Qu.: 2700
## Median : 65.5 Median :6.408e+07 Median : 13300
## Mean : 745.7 Mean :1.756e+08 Mean : 36886
## 3rd Qu.: 139.8 3rd Qu.:1.688e+08 3rd Qu.: 37900
## Max. :7741.0 Max. :6.589e+09 Max. :850900
## NA's :33 NA's :56
## highest_monthly_earnings lowest_yearly_earnings highest_yearly_earnings
## Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 43500 1st Qu.: 32650 1st Qu.: 521750
## Median : 212700 Median : 159500 Median : 2600000
## Mean : 589808 Mean : 442257 Mean : 7081814
## 3rd Qu.: 606800 3rd Qu.: 455100 3rd Qu.: 7300000
## Max. :13600000 Max. :10200000 Max. :163400000
##
## subscribers_for_last_30_days created_year created_month created_date
## Min. : 1 Min. :1970 Length:995 Min. : 1.00
## 1st Qu.: 100000 1st Qu.:2009 Class :character 1st Qu.: 8.00
## Median : 200000 Median :2013 Mode :character Median :16.00
## Mean : 349079 Mean :2013 Mean :15.75
## 3rd Qu.: 400000 3rd Qu.:2016 3rd Qu.:23.00
## Max. :8000000 Max. :2022 Max. :31.00
## NA's :337 NA's :5 NA's :5
## Gross.tertiary.education.enrollment.... Population Unemployment.rate
## Min. : 7.60 Min. :2.025e+05 Min. : 0.750
## 1st Qu.: 36.30 1st Qu.:8.336e+07 1st Qu.: 5.270
## Median : 68.00 Median :3.282e+08 Median : 9.365
## Mean : 63.63 Mean :4.304e+08 Mean : 9.279
## 3rd Qu.: 88.20 3rd Qu.:3.282e+08 3rd Qu.:14.700
## Max. :113.10 Max. :1.398e+09 Max. :14.720
## NA's :123 NA's :123 NA's :123
## Urban_population Latitude Longitude
## Min. : 35588 Min. :-38.42 Min. :-172.10
## 1st Qu.: 55908316 1st Qu.: 20.59 1st Qu.: -95.71
## Median :270663028 Median : 37.09 Median : -51.93
## Mean :224214982 Mean : 26.63 Mean : -14.13
## 3rd Qu.:270663028 3rd Qu.: 37.09 3rd Qu.: 78.96
## Max. :842933962 Max. : 61.92 Max. : 138.25
## NA's :123 NA's :123 NA's :123
Based on the glance of the structure and summary, there are 28 variables in total, contains 4 integer variables, 17 numeric variables and 7 character variables. 995 observations have been collected. Among these variables, the earning relevant variables will be transformed into a binary category variable as a target variable (known as Y) to apply the classification, other variables will be engineered and analysed the relationship with the target variable. In order to classify the income level, a data set of GDP from World Bank (2022) will be referred.
In addition, from the glance and the understanding of the project, several assumptions will be listed as follow:
The data cleaning, transformation and target variable proceed will be processed referring to assumptions above.
This process will be simplified as the project 1 has already explained. Same for Data Transformation part.
The following steps will be conducted:
#drop the redundant columns (value them to NULL)
df$ rank <- NULL
df$ Youtuber <- NULL
df$ Title <- NULL
df$ Abbreviation <- NULL
df$ video_views_rank <- NULL
df$ country_rank <- NULL
df$ channel_type_rank <- NULL
df$channel_type <- NULL
#drop variables that related to the country
df$Longitude <- NULL
df$Latitude <- NULL
df$Urban_population <- NULL
df$Unemployment.rate <- NULL
df$Population <- NULL
df$Gross.tertiary.education.enrollment.... <- NULL#Convert insensible value to NA
#covert the "NaN" and "nan" into NA for all columns
df <- mutate_all(df, function(x) ifelse(x == "NaN", NA, x))
df <- mutate_all(df, function(x) ifelse(x == "nan", NA, x))
#covert the 0 in "video views" and "upload" into NA
df <- mutate(df, video.views=na_if(video.views,0) )
df <- mutate(df, uploads=na_if(uploads,0) )#Check NA value in the data set
nullcheck <- function(df) {
sapply (df, function(x){ sum(is.na(x))}) }
nullcheck(df)## subscribers video.views
## 0 8
## category uploads
## 46 43
## Country video_views_for_the_last_30_days
## 122 56
## lowest_monthly_earnings highest_monthly_earnings
## 0 0
## lowest_yearly_earnings highest_yearly_earnings
## 0 0
## subscribers_for_last_30_days created_year
## 337 5
## created_month created_date
## 5 5
#replace the null in "country" to "Missing"
df$Country <- ifelse(is.na(df$Country), "Missing", df$Country)
#replace the null in "category" to "Missing"
df$category <- ifelse(is.na(df$category), "Missing", df$category)#Use mode to replace the null values in 3 date relevant variables
year_mode <- as.numeric(names(sort(table(df$created_year), decreasing = TRUE)[1]))
df$created_year[is.na(df$created_year)] <- year_mode
month_mode <- names(sort(table(df$created_month), decreasing = TRUE)[1])
df$created_month[is.na(df$created_month)] <- month_mode
date_mode <- as.numeric(names(sort(table(df$created_date), decreasing = TRUE)[1]))
df$created_date[is.na(df$created_date)] <- date_mode#find rows that viewers/subscribers in the last 30 days are higher than all viewers/subscribers
#they are insensible
sum(df$video_views_for_the_last_30_days > df$video.views, na.rm = TRUE)## [1] 4
## [1] 0
#delete this insensible rows and video views & video view for last 30 days is null
#that is because the number of these value is relatively low
df <- subset(df, !(is.na(video_views_for_the_last_30_days) & video_views_for_the_last_30_days < video.views))
df <- subset(df, video_views_for_the_last_30_days < video.views)
df <- subset(df, !is.na(df$video.views))#use a new function to check the null and corresponding column/row straight forwardly
plot_obj <- md.pattern(df, plot = TRUE, rotate.names = TRUE)## [1] 1
#from the plot, there are still 293 null in subcribers for last 30 days column
#use vtreat for the rest variables (numeric)
varlist <- colnames(df)
treatment_plan <- design_missingness_treatment( df, varlist = varlist)
df <- prepare(treatment_plan, df) ## /\ /\
## { `---' }
## { O O }
## ==> V <== No need for mice. This data set is completely observed.
## \ \|/ /
## `-----'
## [1] 0
All insensible data has been cleaned and there is no missing value in the final data frame.
Data trasformation is the process to transfer data into the format/type that can be used in the further modeling/prediction/clustering etc. In this step, a common data transformation process will be applied to fot general process. Some specific data transformation (such as category to numeric only for Clustering) and treatment will be applied just before the specific process, not in 1.3.
This data transformation contains following steps mainly, this step is also simplified as the former project has explained:
#Transfer the yearly earning to average. Drop the monthly earning as the Project 1 had discovered the direct positive correlation between monthly and yearly earning.
df$avg_yearlyearning <- (df$highest_yearly_earnings+df$lowest_yearly_earnings)/2
df$ highest_monthly_earnings <- NULL
df$ lowest_monthly_earnings <- NULL
df$ highest_yearly_earnings <- NULL
df$ lowest_yearly_earnings <- NULL#transform date relavant variables as combination and drop the used columns
#in order to deliver the create date info into the predict model, convertn them into numerical
numbermonth <- c("Jan" = 1, "Feb" = 2, "Mar" = 3, "Apr" = 4, "May" = 5, "Jun" = 6, "Jul" = 7, "Aug" = 8, "Sep" = 9, "Oct" = 10, "Nov" = 11, "Dec" = 12)
df$created_month <- numbermonth[df$created_month]
df$created_time <- as.Date(paste(df$created_year, df$created_month, df$created_date, sep = "-"))
#drop the colmns that already been transformed
df$ created_year <- NULL
df$ created_month <- NULL
df$ created_date <- NULL#in order to deliver the create date info into the predict model, convert them into numerical
df$numeric_date = as.numeric(df$created_time)
df$created_time <- NULL#conver country to region
#select top 30 countries as the count
df%>%
count(Country, sort = TRUE) %>%
top_n(n=30,wt=n)## Country n
## 1 United States 301
## 2 India 167
## 3 Missing 87
## 4 Brazil 59
## 5 United Kingdom 43
## 6 Mexico 30
## 7 Indonesia 27
## 8 Spain 22
## 9 Thailand 18
## 10 South Korea 17
## 11 Russia 15
## 12 Argentina 13
## 13 Canada 13
## 14 Philippines 12
## 15 Colombia 11
## 16 Australia 8
## 17 Saudi Arabia 8
## 18 Ukraine 8
## 19 United Arab Emirates 7
## 20 Pakistan 6
## 21 France 5
## 22 Germany 5
## 23 Japan 5
## 24 Sweden 4
## 25 Turkey 4
## 26 Chile 3
## 27 Jordan 3
## 28 Netherlands 3
## 29 Vietnam 3
## 30 Ecuador 2
## 31 Egypt 2
## 32 Iraq 2
## 33 Italy 2
## 34 Singapore 2
#group the top 30 countires into Region
region_map <- data.frame(
Country = c("United States","India","Brazil","United Kindom","Mexico","Indonesia","Spain","Thailand","Russia","South Korea","Argentina","Canada","Colombia","Philippines","Australia","Saudi Arabia","Ukraine","France", "Germany","Japan","Pakistan","United Arab Emirate","Chile","Jordan","Netherlands","Sweden","Turkey", "Vietnam","Ecuador", "Egypt"),
region = c("North America", "South Asia","South America", "Europe","South America", "Asia Pacific", "Europe", "Asia Pacific", "Europe","Asia Pacific","South America", "North America", "South America", "Asia Pacific","Asia Pacific", "Middle East", "Europe", "Europe", "Europe", "Asia Pacific", "South Asia", "Middle East", "South America", "Middle East", "Europe", "Europe", "Middle East", "Asia Pacific", "South America", "Middle East")
)
df <- df %>%
left_join(region_map, by = "Country") %>%
mutate(region = ifelse(is.na(region), "Others", region))#conver category to a more general level
#list all the level of category, there are 20 levels
unique(df$category)## [1] "Music" "Entertainment" "Education"
## [4] "Shows" "People & Blogs" "Gaming"
## [7] "Sports" "Film & Animation" "Howto & Style"
## [10] "News & Politics" "Comedy" "Trailers"
## [13] "Nonprofits & Activism" "Missing" "Science & Technology"
## [16] "Movies" "Pets & Animals" "Autos & Vehicles"
## [19] "Travel & Events"
#change them into a more general category to reduce the levels as the high amount of level in the little amount of abservaition may cause stuck in the model, after transfer, there is 14 levels
categorymap <- data.frame(
category = c("Howto & Style","Entertainment","Music","Comedy", "Gaming","Film & Animation","Education", "Sports", "News & Politics","People & Blogs" , "Nonprofits & Activism" ,"Autos & Vehicles","Science & Technology" , "Shows" , "Trailers", "Missing","Movies" ,"Travel & Events","Pets & Animals"), newcate =c("Howto & Style","Entertainment","Music","Entertainment", "Gaming" ,"Film & Animation","Education", "Sports", "News & Politics","People & Blogs" , "Nonprofits & Activism" ,"Autos & Vehicles","Education" , "Entertainment","Entertainment", "Missing","Film & Animation" ,"Entertainment","Pets & Animals"))
df <- df %>%
left_join(categorymap, by = "category") %>%
mutate(newcate = ifelse(is.na(newcate), "Others", newcate))In this process, the yearly income data will be proceed as a binary category variable (high income and low income) for classification use. The income will be generalized by the 2022 GDP data, which is referred and merged from World Bank. As the original data is not perfect and exists several missing value, this data will also be proceed. There are following main steps:
#import original GDP data
gdp <- read.csv('/Users/lianmingyu/Downloads/API_NY/API_NY.GDP.PCAP.CD_DS2_en_csv_v2_5871588.csv', header =TRUE, skip = 4)#GDP data cleaning and transformation
#several country dismatch in the original data, change the country name of these
#convert Korea into South Korea
gdp$Country.Name[gdp$Country.Name == "Korea, Rep."] <- "South Korea"
#convert Russia into Russian Federation
gdp$Country.Name[gdp$Country.Name == "Russian Federation"] <- "Russia"
#convert Turkiye into Turkey
gdp$Country.Name[gdp$Country.Name == "Turkiye"] <- "Turkey"
#convert venezuela, rb into Venezuela
gdp$Country.Name[gdp$Country.Name == "Venezuela, RB"] <- "Venezuela"
#convert egypt, arab rep. into Egypt
gdp$Country.Name[gdp$Country.Name == "Egypt, Arab Rep."] <- "Egypt"#Drop the useless columns and just select country and GDP in 2022
exclude_columns <- names(gdp)[2:48]
columns_to_impute <- c( "X2022")
#Fill the missing value
# Use MICE function. Analyse the historical gdp data and make prediction of 2022 to fill the missing value in 2022 gdp (using random foreset method).
miceMod <- mice(gdp[, !names(gdp) %in% exclude_columns],
method = "rf", )
miceOutput <- complete(miceMod)After the simple proceed of GDP data, it can be merged with the df.
#merge the two data set and rename the 2022 gdp column as 'gdp'
gdp <- miceOutput[c("Country.Name", "X2022")]
df <- merge(df, gdp, by.x = "country", by.y = "Country.Name", all.x = TRUE)
df <- rename(df, gdp = X2022)#after merging, there should be null value in gdp for 'Missing' country coulmn
#Use median to fill the null value in gpd
df$gdp[is.na(df$gdp)] <- median(df$gdp, na.rm = TRUE)#Check the null value of merged data frame
plot_obj <- md.pattern(df, plot = TRUE, rotate.names = TRUE)## /\ /\
## { `---' }
## { O O }
## ==> V <== No need for mice. This data set is completely observed.
## \ \|/ /
## `-----'
## [1] 0
#normalize the income by GDP per capita
df$normalizeincome <- df$avg_yearlyearning/df$gdp
#set the income to income_class (target variable) according to the comparison with GDP, if lower than average time of income/GDP, will be set as 0, otherwise will be set as 1
df$income_class <- ifelse(df$normalizeincome < median(df$normalizeincome), 0, 1)#in the last, delete average income,normalizeincome, gdp column as they have already contributed to the target valiable generation.
#besides, as discussed, the 'country' will also be droped as it contains over amount of levels that may lead to a unefficiency and stuck in the further model development, the information on coutnry will be reflected on the 'region' variable.
df$normalizeincome <- NULL
df$avg_yearlyearning <- NULL
df$gdp <- NULL
df$country <- NULL##
## 0 1
## 466 466
##
## 0 1
## 0.5 0.5
After a series of data proceeding, the noise data and redundant variables have been removed, the target variable has been determined and generated. 9 out of all variables are kept as they are suitable to conduct classification or clustering with.
The target variable can imply whether the Youtuber has a higher or lower income comparing to the average income in their own country. In addition, the distribution of the binary level are half-half balanced, which is suitable for the subsequent classification process.
Classification is an supervised process that building models to find the relationship between independent variables and dependent variable. In this report, 9 variables will be used to predict a binary target category variable called ‘income_class’, the high income will show as 1 and low income will show as 0.
Here raise some questions on classification, which will lead to the entire classification process:
Based on the raided question above, here lists the main steps in classification phase:
In this process under classification, the data will be split into 2 parts first: dTrainAll, dTest. Then dTrainAll will be split to dTrain as final training data set and dCal as calibration data set. The proportion is always 80% and 20% respectively.
In the feature selection and model development step, dTrain will be used to train model, dCal will be used to assess the model inside each process. Only in the final step, dTest will be used on the final 3 variables combinations * 3 model to evaluate which setting perform best. However, as the original data has limited observation, the high variance situation (over-fitting) may occur in a high possibility, a method called Cross-Validation will be introduced in the practical process.
In addition, based on the type of different variables, they will be grouped in order to fit the model algorithm.
#split to all train data and test data
set.seed(40094009)
df$rgroup <- runif(dim(df)[1])
dTrainAll <- subset(df, rgroup<=0.8)
dTest <- subset(df, rgroup>0.8)
#group the variables into numeric and category based on the type, also define the outcome variable
valid_columns <- setdiff(colnames(dTrainAll), c('income_class', 'rgroup'))
catvars <- valid_columns[sapply(dTrainAll[, valid_columns], class) %in%
c('factor', 'character')]
numvars <- valid_columns[sapply(dTrainAll[, valid_columns], class) %in%
c('numeric', 'integer')]
outcome <- c("income_class")
pos <- '1'
#split the dTrainAll into training set and validation (calibration) set
useForCal <- rbinom(n=dim(dTrainAll)[1], size=1, prob=0.2)>0
dCal <- subset(dTrainAll, useForCal)
dTrain <- subset(dTrainAll, !useForCal)
#delete Rgroup as useless after spliting
df$rgroup <- NULLFeature selection is an essential step of model building, which helps to enhance the model performance by removing the variables that are not correlated to the outcome. It helps to keep the model away from the impacts from noise and deviation elements. This project aims to find out the best variable combinations to generate the best classification model.
Three feature selection methods are used in the following codes for the further multivariable modeling part. The method includes Boruta Algorithm, Log-likelihood and the combination of correlation coefficient and chi-square tests. Although the selected variables are mostly the same, the performance is slightly different during the evaluation.
Single variable method is a method to find the good performed single variables by listing each of the independent variable’s performance. In another word, use each of single variable to conduct a model building, gain a series of single variable models as many as the quantity of variables, then evaluate each models to find well performed ones. The corresponded variable of the well-performed model will be selected as the feature in the classification.
In this project, ROC/AUC chart and log-likelihood will be applied to evaluate how each variable performs (or what extent of impacts each variable will have on the final target variable - income_class). By this process, a combination of features will be selected base on the evaluation method.
Besides, as the quantity of observations is relatively low (only around 1000), there is a high possibility for the computer to build an over-fitting model. Therefore, cross-validation method will be practiced here in order to prevent the case that selecting an over-fitting model and variable.
Therefore, as the single variable selection is complex, the project will step as the following parts:
#set the single variable model function
mkPredC <- function(outCol,varCol,appCol) {
pPos <- sum(outCol==pos)/length(outCol)
naTab <- table(as.factor(outCol[is.na(varCol)]))
pPosWna <- (naTab/sum(naTab))[pos]
vTab <- table(as.factor(outCol),varCol)
pPosWv <- (vTab[pos,]+1.0e-3*pPos)/(colSums(vTab)+1.0e-3)
pred <- pPosWv[appCol]
pred[is.na(appCol)] <- pPosWna
pred[is.na(pred)] <- pPos
pred
}#Set AUC function to evaluate the catvariables performance
calcAUC <- function(predcol, outcol) {
perf <- performance(prediction(predcol, outcol==pos),'auc')
as.numeric(perf@y.values)
}#rank as the AUC for category variables (as the null model will perform 0.5 of AUC, so extract the varibales have better performance than null model which is AUC >= 0.5)
for (x in catvars) {
pii <- paste('pred', x, sep = '')
aucTrain <- calcAUC(dTrain[, pii], dTrain[, outcome])
if (aucTrain >= 0.50) {
aucCal <- force(calcAUC(dCal[, pii], dCal[, outcome]))
print(sprintf(
"%s, trainAUC: %4.3f calibrationAUC: %4.3f",
pii, aucTrain, aucCal))
}
}## [1] "predregion, trainAUC: 0.808 calibrationAUC: 0.862"
## [1] "predcategory, trainAUC: 0.628 calibrationAUC: 0.537"
A null model will be set and check the AUC as if any of single variable model has the worse performance than null model, it will be treat as meaningless, since null model can not perform better than the selected variable:
## [1] 330
## Proportion of outcome == 1 in dTrain: 0.5454545
TP <- 0; TN <- sum(dCal[,outcome] == -1);
FP <- 0; FN <- sum(dCal[,outcome] == 1);
cat("nrow(dCal):", nrow(dCal), "TP:", TP, "TN:", TN, "FP:", FP, "FN:", FN) ## nrow(dCal): 143 TP: 0 TN: 0 FP: 0 FN: 59
## [1] 0
## [1] NaN
## [1] 0
## [1] 0.5
#Conduct model for numeric variables (similar as category)
mkPredN <- function(outCol,varCol,appCol) {
cuts <- unique(as.numeric(quantile(varCol, probs=seq(0, 1, 0.1), na.rm=T)))
varC <- cut(varCol, cuts)
appC <- cut(appCol, cuts)
mkPredC(outCol, varC, appC)
}
for(v in numvars) {
pii<-paste('pred',v,sep='')
dTrain[,pii] <- mkPredN(dTrain[,outcome], dTrain[,v], dTrain[,v])
dTest[,pii] <- mkPredN(dTrain[,outcome], dTrain[,v], dTest[,v])
dCal[,pii] <- mkPredN(dTrain[,outcome], dTrain[,v], dCal[,v])
aucTrain <- calcAUC(dTrain[,pii],dTrain[,outcome])
if(aucTrain>=0.5) {
aucCal<-calcAUC(dCal[,pii],dCal[,outcome])
print(sprintf(
"%s, trainAUC: %4.3f calibrationAUC: %4.3f",
pii,aucTrain,aucCal))
}
}## [1] "predsubscribers, trainAUC: 0.595 calibrationAUC: 0.578"
## [1] "predviews, trainAUC: 0.646 calibrationAUC: 0.552"
## [1] "preduploads, trainAUC: 0.684 calibrationAUC: 0.709"
## [1] "predlast_month_view, trainAUC: 0.856 calibrationAUC: 0.768"
## [1] "predlast_month_subscribe, trainAUC: 0.734 calibrationAUC: 0.726"
## [1] "predsubscribers_for_last_30_days_isBAD, trainAUC: 0.695 calibrationAUC: 0.682"
## [1] "prednumeric_date, trainAUC: 0.621 calibrationAUC: 0.663"
Also, ROC plot will be plotted to see the performance straightforwardly, the more the line is up and right, indicates the variable has higher AUC, which means it can predict the target variable better:
plot_roc <- function(predcol, outcol, colour_id=2, overlaid=F){
ROCit_obj <- rocit(score=predcol, class=outcol==pos)
par(new=overlaid)
plot(ROCit_obj,
col = c(colour_id, 1), legend = FALSE, YIndex = FALSE, values = FALSE)
}
plot_roc(dCal$predregion, dCal[,outcome],colour_id=1)
plot_roc(dCal$predlast_month_view, dCal[,outcome], colour_id=2, overlaid=T)
plot_roc(dCal$preduploads, dCal[,outcome],colour_id=3,overlaid=T)
plot_roc(dCal$predlast_month_subscribe, dCal[,outcome], colour_id=4, overlaid=T)
# Add a legend
legend("topright", legend = c("region", "lst_mth_view", "uploads", "lst_mth_subscribe"), col = 1:4, lty = 1)Based on the ROC evaluation (number and charts) for category and numeric variable, there are several variables perform well with AUC/ROC of calibration data: region in category, last month subscribe and last month view in numeric.
Then, Log-likelihood method will be conduct to select the variables to see if can obtain same results:
#log-likelyhood
#first calculate the loglikelihood of null model, the single variable model's log-likelihood value should be higher than the null model log-likelihood
logLikelihood <- function(ytrue, ypred, epsilon=1e-6) {
sum(ifelse(ytrue==pos, log(ypred+epsilon), log(1-ypred-epsilon)), na.rm=T)
}
# Compute the likelihood of the Null model on the calibration
# set (for the KDD dataset from previous lecture)
outcome <- 'income_class'
logNull <- logLikelihood( dCal[,outcome], sum(dCal[,outcome]==pos)/nrow(dCal) )
cat(logNull) ## -96.92346
#run through categorical variables, and calcuate how much the variable is higher than the null model log-likelihood. The higher mark, the better performance.
selCatVars <- c()
minDrop <- 10
for (v in catvars)
{ pi <- paste('pred', v, sep='')
devDrop <- 2*(logLikelihood(dCal[,outcome], dCal[,pi]) - logNull)
if (devDrop >= minDrop)
{
print(sprintf("%s, deviance reduction: %g", pi, devDrop))
selCatVars <- c(selCatVars, pi)
}
}## [1] "predregion, deviance reduction: 65.8983"
#run through numerical variables as similar way
selNumVars <- c()
minDrop <- 10
# may need to adjust this number
for (v in numvars)
{
pi <- paste('pred', v, sep='')
devDrop <- 2*(logLikelihood(dCal[,outcome], dCal[,pi]) - logNull)
if (devDrop >= minDrop) {
print(sprintf("%s, deviance reduction: %g", pi, devDrop))
selNumVars <- c(selNumVars, pi)
}
}## [1] "preduploads, deviance reduction: 10.3827"
## [1] "predlast_month_view, deviance reduction: 41.4714"
## [1] "predlast_month_subscribe, deviance reduction: 23.2305"
## [1] "predsubscribers_for_last_30_days_isBAD, deviance reduction: 13.8241"
## [1] "preduploads"
## [2] "predlast_month_view"
## [3] "predlast_month_subscribe"
## [4] "predsubscribers_for_last_30_days_isBAD"
## [1] "predregion"
In conclusion, in the first phase of single variable model method, the ROC/AUC evaluation method and log-likelihood method obtain similar output combination.
However, as mentioned above, the method may contains over-fitting problem due to the limited observations. Therefore, here introduce Cross-Validation to check how much of the average AUC/ROC and the deviation of them, based on which, the final feature selection can be conduct with preventing the over-fitting problem.
Cross validation is a method splitting data randomly by a specific number of time, and for each time of split, calculate the performance of the variable. Then calculate the overall performance of each single variable.
This method can detect the over-fitting single variable effectively: Compare the average AUC with the original AUC, if the difference is huge, it indicate the first round of single variable model building may leading to unstable performance . Otherwise, it shows that all models (different split) for the variable perform similar and less possibility to trigger the over-fitting problem.
#cross validation check
vars <- c('region' , 'last_month_subscribe', 'last_month_view', 'uploads',"subscribers_for_last_30_days_isBAD")
for (var in vars)
{ aucs <- rep(0,100)
for (rep in 1:length(aucs))
{ useForCalRep <- rbinom(n=nrow(dTrainAll), size=1, prob=0.1) > 0
predRep <- mkPredC(dTrainAll[!useForCalRep, outcome],
dTrainAll[!useForCalRep, var],
dTrainAll[useForCalRep, var])
aucs[rep] <- calcAUC(predRep, dTrainAll[useForCalRep, outcome]) }
print(sprintf("%s: mean: %4.3f; sd: %4.3f", var, mean(aucs), sd(aucs))) }## [1] "region: mean: 0.814; sd: 0.052"
## [1] "last_month_subscribe: mean: 0.527; sd: 0.018"
## [1] "last_month_view: mean: 0.556; sd: 0.030"
## [1] "uploads: mean: 0.577; sd: 0.062"
## [1] "subscribers_for_last_30_days_isBAD: mean: 0.662; sd: 0.061"
Based on the Cross-validation, the variable ‘subscribers_for_last_30_days_isBAD’ and variable ‘uploads’ shows a relatively high difference with the original AUC, and relatively high sd. It implies the contingency of these two variable. Therefore, although both of variables showed a decent performance, they may bring latent over-fitting problem. It better not to select variable ‘subscribers_for_last_30_days_isBAD’ and variable ‘uploads’ into the feature selection combination.
In this step, double density plot will be plotted to check the distribution of the three selected variables (region, last_month_subscribe, last_month_subscribe)
## Factor w/ 7 levels "Asia Pacific",..: 5 6 6 6 6 6 6 6 6 6 ...
## Factor w/ 42 levels "1","3","4","5",..: 20 21 20 17 23 34 16 23 16 17 ...
## Factor w/ 588 levels "1","2","3","4",..: 19 488 282 441 520 439 297 553 163 269 ...
fig1 <- ggplot(dCal) + geom_density(aes(x=predregion, color=as.factor(income_class))) + labs(title = "Double Desnity Plot for Region")
fig2 <- ggplot(dCal) + geom_density(aes(x=predlast_month_subscribe , color=as.factor(income_class))) + labs(title = "Double Desnity Plot for Last Month Subscribe")
fig3 <- ggplot(dCal) + geom_density(aes(x=predlast_month_view, color=as.factor(income_class))) + labs(title = "Double Desnity Plot for Last Month View")From the three charts above, all of the three variables show different distribution when the income_class is different:
Therefore, all of the three variables selected can be deliver to next step as each of them have differential impacts on the target variable.
In summary, via the complex analysis of single variable (model), there are around 5 variables are selected in the ROC and log-likelihood evaluation. However, the cross validation method filter out 2 variables to prevent potential over-fitting scenario. The remaining 3 variables are kept and they show a great performance in the double density plot.
#clarify the final selection of the feature (Combination 1)
selNumVars <- c("predregion")
selCatVars <- c("predlast_month_subscribe" ,"predlast_month_view" )
selVars <- c(selNumVars,selCatVars)
selVars## [1] "predregion" "predlast_month_subscribe"
## [3] "predlast_month_view"
Boruta Algorithm works as a wrapper algorithm around random forest. It creates randomness to the given data set by making shadow features with shuffled features. Then it evaluates each features’ importance after a random forest classifier is trained. It will select the features which has higher importance than the best shadow feature (after several iterations it will generate the result when every feature is confirmed or rejected).
The code following put the features into Boruta algorithm to evaluate and select the well performed. The status of every features could be checked by the stats printed below.
df_bo <- dTrain %>%
select(all_of(c(catvars,numvars, outcome)))
Bo <- Boruta(income_class~.,data = df_bo,pValue = 0.01,doTrace=2,maxRuns = 20)## meanImp medianImp minImp maxImp
## region 50.047338 49.9624328 47.470157 53.270337
## category 0.687098 0.3865762 -1.196362 3.325358
## subscribers 2.316827 2.3045817 1.215587 3.958198
## views 11.927217 11.9736820 10.359804 13.483796
## uploads 17.488295 17.3985283 16.120871 18.858317
## last_month_view 55.568879 55.3204469 52.359099 58.703504
## last_month_subscribe 16.054622 15.9532570 14.197869 17.332215
## subscribers_for_last_30_days_isBAD 15.538126 15.2532132 14.088506 16.807978
## numeric_date 9.674731 9.5487433 7.894443 12.388671
## normHits decision
## region 1.0000000 Confirmed
## category 0.1578947 Tentative
## subscribers 0.3684211 Tentative
## views 1.0000000 Confirmed
## uploads 1.0000000 Confirmed
## last_month_view 1.0000000 Confirmed
## last_month_subscribe 1.0000000 Confirmed
## subscribers_for_last_30_days_isBAD 1.0000000 Confirmed
## numeric_date 1.0000000 Confirmed
A plot is generated to show how the features performed. Obviously as below plot, “region” and ” last_month_view” have distinctive performance and “category” and “subscribers” are rejected because of the importance of that is lower than the best shadow feature.
# generate the plot for showing the feature weight.
plot(Bo, xlab = "", xaxt = "n")
lz<-lapply(1:ncol(Bo$ImpHistory),function(i)
Bo$ImpHistory[is.finite(Bo$ImpHistory[,i]),i])
names(lz) <- colnames(Bo$ImpHistory)
Labels <- sort(sapply(lz,median))
axis(side = 1,las=2,labels = names(Labels),
at = 1:ncol(Bo$ImpHistory), cex.axis = 0.7)In conclusion, the Boruta Algorithm selected 7 features as the names nominated by “selVar2” below.
## [1] "region" "views"
## [3] "uploads" "last_month_view"
## [5] "last_month_subscribe" "subscribers_for_last_30_days_isBAD"
## [7] "numeric_date"
Two evaluation methods will be combined and applied: the chi-squared test will be used for categorical features, and the correlation experiment will be used for numerical features. The combination of the method will deliver a set of selected features.
First apply Chi-square function for category features:
##
## Pearson's Chi-squared test
##
## data: dTrain$category and dTrain$income_class
## X-squared = 41.622, df = 12, p-value = 3.855e-05
##
## Pearson's Chi-squared test
##
## data: dTrain$region and dTrain$income_class
## X-squared = 191.69, df = 6, p-value < 2.2e-16
From the stats above, “region” is significantly high performed in the Chi-squared test, which would be add into the following process. “category” would be abandoned due to the high p-value and low X-squared score.
Then apply correlation coefficient for numerical features:
#correlation coefficient function
df_num <- dTrain %>%
select(all_of(c(numvars, outcome)))
chart.Correlation(df_num, histogram=TRUE, pch=19)The above plot shows the performance of each numerical variable. According to the plot, almost every feature is well performed for the further use except for the “subscirbers_for_the_last_30_days_isBAD”. Besides, the result of chi-square test shows that region is the best categorical feature with high performance and very low p-value. In addition, the coefficient between views and subscribers is very high (also last_month features also). However, the value is still reasonable for the further use, so these features will still be kept.
## [1] "subscribers" "views"
## [3] "uploads" "last_month_view"
## [5] "last_month_subscribe" "subscribers_for_last_30_days_isBAD"
## [7] "numeric_date" "region"
Finally, 8 features are selected from this selection period, which would used for the multivariable models to process by these 2 feature selection method.
The information below shows the result of the 3 feature selection methods. From log-likelyhood there were 3 features selected: ’last_month_view”, “last_month_subscribe’,”region”. Boruta Algorithms expelled “category” and “subscribers” and contains 7 variables including “views”, “uploads”, “subscribers_for_last_30_days_isBAD” and “numeric_date”. With the chi-square test and correlation coefficient, one addition feature “subscribers” was selected than the second method.
## [1] "region" "last_month_subscribe" "last_month_view"
## [1] "region" "views"
## [3] "uploads" "last_month_view"
## [5] "last_month_subscribe" "subscribers_for_last_30_days_isBAD"
## [7] "numeric_date"
## [1] "subscribers" "views"
## [3] "uploads" "last_month_view"
## [5] "last_month_subscribe" "subscribers_for_last_30_days_isBAD"
## [7] "numeric_date" "region"
After the feature selection in the previous phases, there are 3 variable combinations can be used in the model generation. However, the classifiers can be used have not been decided. Based the considering of the data type, observartion quantity, three classifier would be used for the multi-variable models training:
In this part, 3 selected features will be applied into 3 classifiers and do various evaluation. For each of classifier, the best variable combination out of 3 will be selected respectively, and all of these models (3 classifier with the best variable combination) will be interpreted by plot or LIME. IN the end, deliver these models to final evaluation.
Before all of the processing, we refreshes the training data ,calibration data and test data for 3 data sets.
#arrage 3 data set the data (simply copy)
dTrain1 <- dTrain %>%
select(all_of(c(selVars, outcome)))
dTrain2 <- dTrain %>%
select(all_of(c(selVar2, outcome)))
dTrain3 <- dTrain %>%
select(all_of(c(selVar3, outcome)))
dTest1 <- dTest %>%
select(all_of(c(selVars, outcome)))
dTest2 <- dTest %>%
select(all_of(c(selVar2, outcome)))
dTest3 <- dTest %>%
select(all_of(c(selVar3, outcome)))
dCal1 <- dCal %>%
select(all_of(c(selVars, outcome)))
dCal2 <- dCal %>%
select(all_of(c(selVar2, outcome)))
dCal3 <- dCal %>%
select(all_of(c(selVar3, outcome)))The Decision Tree Model is a supervised learning method for classification, which predicts the value by using decision rules. Decision Tree Model could be explained simply so there is no need to use an interpretation model.
The Decision Tree Model process was divided into parts including:
# Decision Tree First Evaluation
(fV <- paste(outcome,'> 0 ~ ', paste(selVars, collapse=' + '), sep='')) ## [1] "income_class> 0 ~ region + last_month_subscribe + last_month_view"
t1model <- rpart(fV, data=dTrain1)
print(calcAUC(predict(t1model, newdata=dTrain1), dTrain1[,outcome])) ## [1] 0.9630854
## [1] 0.9291176
## [1] 0.9763923
# Decision Tree Second Evaluation
(fV <- paste(outcome,'> 0 ~ ', paste(selVar2, collapse=' + '), sep='')) ## [1] "income_class> 0 ~ region + views + uploads + last_month_view + last_month_subscribe + subscribers_for_last_30_days_isBAD + numeric_date"
t2model <- rpart(fV, data=dTrain2)
print(calcAUC(predict(t2model, newdata=dTrain2), dTrain2[,outcome])) ## [1] 0.9730744
## [1] 0.9359752
## [1] 0.9867837
# Decision Tree Third Evaluation
(fV <- paste(outcome,'> 0 ~ ', paste(selVar3, collapse=' + '), sep='')) ## [1] "income_class> 0 ~ subscribers + views + uploads + last_month_view + last_month_subscribe + subscribers_for_last_30_days_isBAD + numeric_date + region"
t3model <- rpart(fV, data=dTrain3)
print(calcAUC(predict(t3model, newdata=dTrain3), dTrain3[,outcome]))## [1] 0.9730744
## [1] 0.9359752
## [1] 0.9867837
From the AUC calculated above, the variable combination 2 and 3 have identical result of AUC, the value from 3 models are almost the same. It raises the question why this situation happens. one hypothesis for this phenomenon is that the model might ignore some features contributing to the differential result, and only keep some useful feature. It will be explore in the next part.
This step will evaluate the model from ROC, and performance matrix
#ROC Plot
plot_roc <- function(predcol1, outcol1, predcol2, outcol2,predcol3, outcol3)
{ roc_1 <- rocit(score=predcol1, class=outcol1==pos)
roc_2 <- rocit(score=predcol2, class=outcol2==pos)
roc_3 <- rocit(score=predcol3, class=outcol3==pos)
plot(roc_1, col = c("blue","green"), lwd = 3, legend = FALSE,YIndex = FALSE, values = TRUE, asp=1)
lines(roc_2$TPR ~ roc_2$FPR, lwd = 3, col = c("red","green"), asp=1)
lines(roc_3$TPR ~ roc_3$FPR, lwd = 3, col = c("yellow","green"), asp=1)
legend("bottomright", col = c("blue","red","yellow", "green"), c("combination 1","combination 2", "combination 3", "Null Model"), lwd = 2)
}
pred_model3_roc <- predict(t3model, newdata=dTest3)
pred_model2_roc <- predict(t2model, newdata=dTest2)
pred_model1_roc <- predict(t1model, newdata=dTest1)
plot_roc(pred_model1_roc, dTest1[[outcome]],
pred_model2_roc, dTest2[[outcome]],
pred_model3_roc, dTest3[[outcome]])The plot shows that t2model and t3model have the similar performance, and both are better than t1model. It just match the AUC value from the previous part.
#perfromance matrix
logLikelihood <- function(ytrue, ypred, epsilon=1e-6)
{ sum(ifelse(ytrue, log(ypred+epsilon), log(1-ypred+epsilon)), na.rm=T) }
performanceMeasures <- function(ytrue, ypred, model.name = "model", threshold=0.5)
{ # compute the normalised deviance
dev.norm <- -2 * logLikelihood(ytrue, ypred)/length(ypred)
# compute the confusion matrix
cmat <- table(actual = ytrue, predicted = ypred >= threshold)
accuracy <- sum(diag(cmat)) / sum(cmat)
precision <- cmat[2, 2] / sum(cmat[, 2])
recall <- cmat[2, 2] / sum(cmat[2, ])
f1 <- 2 * precision * recall / (precision + recall)
data.frame(model = model.name,
precision = precision, recall = recall, f1 = f1, dev.norm = dev.norm) }# evaluation function
panderOpt <- function(){
panderOptions("plain.ascii", TRUE)
panderOptions("keep.trailing.zeros", TRUE)
panderOptions("table.style", "simple") }
pretty_perf_table <- function(model, xtrain, ytrain, xtest, ytest, threshold=0.5) {
# Option setting for Pander
panderOpt()
perf_justify <- "lrrrr"
# call the predict() function to do the predictions
pred_train <- predict(model, newdata=xtrain)
pred_test <- predict(model, newdata=xtest)
# comparing performance on training vs. test
trainperf_df <- performanceMeasures( ytrain, pred_train, model.name="training", threshold=threshold)
testperf_df <- performanceMeasures( ytest, pred_test, model.name="test", threshold=threshold)
# combine the two performance data frames using rbind()
perftable <- rbind(trainperf_df, testperf_df)
pandoc.table(perftable, justify = perf_justify)
}
#draw the matrix
pretty_perf_table(t1model, dTrain1[selVars], dTrain1[,outcome]==pos, dTest1[selVars], dTest1[,outcome]==pos)##
##
## model precision recall f1 dev.norm
## ---------- ----------- -------- -------- ----------
## training 0.9357 0.9697 0.9524 0.3798
## test 0.8571 0.9351 0.8944 0.5972
pretty_perf_table(t2model, dTrain2[selVar2], dTrain2[,outcome]==pos, dTest2[selVar2], dTest2[,outcome]==pos)##
##
## model precision recall f1 dev.norm
## ---------- ----------- -------- -------- ----------
## training 0.9357 0.9697 0.9524 0.3388
## test 0.8571 0.9351 0.8944 0.5892
pretty_perf_table(t3model, dTrain3[selVar3], dTrain3[,outcome]==pos, dTest3[selVar3], dTest3[,outcome]==pos)##
##
## model precision recall f1 dev.norm
## ---------- ----------- -------- -------- ----------
## training 0.9357 0.9697 0.9524 0.3388
## test 0.8571 0.9351 0.8944 0.5892
The results shows almost the same with AUC score and ROC plot: The prediction of the model, recall and F1 score are around 90%, which means the performance is extraordinary. For the deviance normalization, the value is also very low, and model 2 and 3 are slightly higher.
Obviously, “region”, “last_month_view”, “subscribers_for_last_30_days_isBAD” performed significant roles in the tree graphs. Besides, the 3 trees are almost the same, which means the hypothesis mentioned above is true: only 2 to 3 features are actually used in the decision tree model. The higher performance from model 2 and 3 might come from the difference (on “subscribers_for_last_30_days_isBAD”) between the trees, which means the model could be more accurate if the missing values is recognized.
Therefore, from the 2nd and 3rd variable combination in the classifier, the final model can be interpreted as: From top to bottom (root to leaves), the judgement will be applied once and once. The region would be consider first. Determine if the Youtubers from Europe, Middle East, North America, then pass to the next level: last month viewer or next last month subscriber. as the judgement step by step, the Youtuber will have a final score as shown in the bottom. if the score is close to 1, it denotes the high possibility as high income, if the score is close to 0, will have high possibility as low income. Besides, the chart also show the proportion of each tree branches.
XGBoost (Extreme Gradient Boosting) is a powerful classifier which is well-known by it’s high performance in many competitions. It is designed for enhance the speed and performance of Gradient Boosted Trees. The classifier trained by this machine learning method will also be used for comparison with other two models.
The XGBoost process is divided into following parts:
The XGBoost model training is also similar to other machine learning methods, however, matrix data is required for the training rather than data frame. Therefore, the project converts all categorical data into numerical one for the process.
#conver the data type
convert_region_to_integer <- function(data) {
data$region <- as.integer(as.factor(data$region))
return(data)
}
#use the function above to convert region from charater to integer
dTrain_int1 <- convert_region_to_integer(dTrain1)
dTest_int1 <- convert_region_to_integer(dTest1)
dCal_int1 <- convert_region_to_integer(dCal1)
dTrain_int2 <- convert_region_to_integer(dTrain2)
dTest_int2 <- convert_region_to_integer(dTest2)
dCal_int2 <- convert_region_to_integer(dCal2)
dTrain_int3 <- convert_region_to_integer(dTrain3)
dTest_int3 <- convert_region_to_integer(dTest3)
dCal_int3 <- convert_region_to_integer(dCal3)
#transform the data frame as matrix
matCal1 <- as.matrix(dCal_int1[, !names(dCal_int1) %in% c("income_class")])
matTrain1 <- as.matrix(dTrain_int1[, !names(dTrain_int1) %in% c("income_class")])
matTest1 <- as.matrix(dTest_int1[, !names(dTest_int1) %in% c("income_class")])
matCal2 <- as.matrix(dCal_int2[, !names(dCal_int2) %in% c("income_class")])
matTrain2 <- as.matrix(dTrain_int2[, !names(dTrain_int2) %in% c("income_class")])
matTest2 <- as.matrix(dTest_int2[, !names(dTest_int2) %in% c("income_class")])
matCal3 <- as.matrix(dCal_int3[, !names(dCal_int3) %in% c("income_class")])
matTrain3 <- as.matrix(dTrain_int3[, !names(dTrain_int3) %in% c("income_class")])
matTest3 <- as.matrix(dTest_int3[, !names(dTest_int3) %in% c("income_class")])#define xgboost function as xgbooost
xgbooost = function(variable_matrix, labelvec) {
cv <- xgb.cv(variable_matrix, label = labelvec,
params=list(
objective="binary:logistic"
),
nfold=5,
nrounds=100,
print_every_n=10,
metrics="logloss")
evalframe <- as.data.frame(cv$evaluation_log)
NROUNDS <- which.min(evalframe$test_logloss_mean)
model <- xgboost(data=variable_matrix, label=labelvec,
params=list(
objective="binary:logistic"
),
nrounds=NROUNDS,
verbose=FALSE)
model
}AUC and performance matrix will be used to evaluate 3 models here.
# define AUC function for XGboost, this function is introduced by GPT-3.5
calculate_auc <- function(model, mat_data, actual_labels, dataset_name) {
pred_prob <- predict(model, newdata = mat_data)
auc_val <- calcAUC(pred_prob, actual_labels)
cat("AUC for", dataset_name, ": ", auc_val, "\n")
return(auc_val)
}
# train the model
model1 <- xgbooost(matTrain1, dTrain1$income_class)## [1] train-logloss:0.489003+0.005236 test-logloss:0.502004+0.012427
## [11] train-logloss:0.115534+0.008557 test-logloss:0.194942+0.043900
## [21] train-logloss:0.080425+0.006269 test-logloss:0.202123+0.056289
## [31] train-logloss:0.065575+0.005248 test-logloss:0.213726+0.060167
## [41] train-logloss:0.056217+0.004990 test-logloss:0.221290+0.063837
## [51] train-logloss:0.049456+0.004658 test-logloss:0.229414+0.066846
## [61] train-logloss:0.044319+0.004322 test-logloss:0.235177+0.068761
## [71] train-logloss:0.040240+0.003695 test-logloss:0.241670+0.069036
## [81] train-logloss:0.036913+0.003198 test-logloss:0.245261+0.070285
## [91] train-logloss:0.034355+0.003062 test-logloss:0.249905+0.071751
## [100] train-logloss:0.032423+0.002979 test-logloss:0.255216+0.072695
## [1] train-logloss:0.487866+0.005881 test-logloss:0.509624+0.010574
## [11] train-logloss:0.089470+0.006903 test-logloss:0.193553+0.053758
## [21] train-logloss:0.045562+0.004705 test-logloss:0.197029+0.077614
## [31] train-logloss:0.031181+0.003174 test-logloss:0.209479+0.087445
## [41] train-logloss:0.024554+0.002185 test-logloss:0.217866+0.093244
## [51] train-logloss:0.020905+0.001722 test-logloss:0.226500+0.098404
## [61] train-logloss:0.018480+0.001311 test-logloss:0.230095+0.097497
## [71] train-logloss:0.016667+0.001039 test-logloss:0.234147+0.099129
## [81] train-logloss:0.015431+0.000892 test-logloss:0.238664+0.100608
## [91] train-logloss:0.014444+0.000782 test-logloss:0.242839+0.103370
## [100] train-logloss:0.013725+0.000652 test-logloss:0.246519+0.105123
## [1] train-logloss:0.485594+0.005842 test-logloss:0.512097+0.007616
## [11] train-logloss:0.086530+0.008117 test-logloss:0.202574+0.047442
## [21] train-logloss:0.043699+0.004603 test-logloss:0.199733+0.059089
## [31] train-logloss:0.029559+0.002450 test-logloss:0.204980+0.061350
## [41] train-logloss:0.023200+0.001843 test-logloss:0.209779+0.064027
## [51] train-logloss:0.019607+0.001205 test-logloss:0.211676+0.065474
## [61] train-logloss:0.017257+0.001011 test-logloss:0.214933+0.068331
## [71] train-logloss:0.015658+0.000938 test-logloss:0.217746+0.070963
## [81] train-logloss:0.014419+0.000837 test-logloss:0.219346+0.072136
## [91] train-logloss:0.013477+0.000705 test-logloss:0.221318+0.072684
## [100] train-logloss:0.012817+0.000663 test-logloss:0.224385+0.074409
# calculate the AUC with the function defined above
auc_train1 <- calculate_auc(model1, matTrain1, dTrain1$income_class, "Training 1")## AUC for Training 1 : 0.9958347
## AUC for Calibration 1 : 0.9975787
## AUC for Test 1 : 0.9645588
## AUC for Training 2 : 0.999449
## AUC for Calibration 2 : 0.9985876
## AUC for Test 2 : 0.9631023
## AUC for Training 3 : 0.9998898
## AUC for Calibration 3 : 0.9993947
## AUC for Test 3 : 0.9630416
From the AUC scores the second one has better performance on calibration data. However, it is not enough to just compare the AUC score here, a further analysis is processed below to get more indicators.
# define the function for XGBoost performance
performanceMeasures <- function(ytrue, ypred_prob, model.name = "model", threshold=0.5) {
ypred <- ifelse(ypred_prob > threshold, 1, 0)
# compute the normalised deviance
dev.norm <- -2 * logLikelihood(ytrue, ypred_prob)/length(ypred_prob)
# compute the confusion matrix
cmat <- table(actual = ytrue, predicted = ypred)
accuracy <- sum(diag(cmat)) / sum(cmat)
precision <- cmat[2, 2] / sum(cmat[, 2])
recall <- cmat[2, 2] / sum(cmat[2, ])
f1 <- 2 * precision * recall / (precision + recall)
data.frame(model = model.name,
precision = precision, recall = recall, f1 = f1, dev.norm = dev.norm)
}
panderOpt <- function(){
panderOptions("plain.ascii", TRUE)
panderOptions("keep.trailing.zeros", TRUE)
panderOptions("table.style", "simple")
}
# pretty_perf_table
pretty_perf_table <- function(model, xtrain, ytrain, xtest, ytest, threshold=0.5) {
# Option setting for Pander
panderOpt()
perf_justify <- "lrrrr"
# call the predict() function to do the predictions
pred_train <- predict(model, newdata=xtrain)
pred_test <- predict(model, newdata=xtest)
# comparing performance on training vs. test
trainperf_df <- performanceMeasures(ytrain, pred_train, model.name="training", threshold=threshold)
testperf_df <- performanceMeasures(ytest, pred_test, model.name="test", threshold=threshold)
# combine the two performance data frames using rbind()
perftable <- rbind(trainperf_df, testperf_df)
pandoc.table(perftable, justify = perf_justify)
}
pretty_perf_table(model1, matTrain1, dTrain_int1$income_class, matTest1, dTest_int1$income_class)##
##
## model precision recall f1 dev.norm
## ---------- ----------- -------- -------- ----------
## training 0.9699 0.9758 0.9728 0.2104
## test 0.8588 0.9481 0.9012 0.4316
##
##
## model precision recall f1 dev.norm
## ---------- ----------- -------- -------- ----------
## training 0.9880 0.9939 0.9909 0.1281
## test 0.9012 0.9481 0.9241 0.4315
##
##
## model precision recall f1 dev.norm
## ---------- ----------- -------- -------- ----------
## training 0.9880 0.9970 0.9925 0.1048
## test 0.8889 0.9351 0.9114 0.4360
From these performance matrix, the second model still has the best performance on the test data for all of the scores. It would be selected for further explanation.
LIME is a interpreter tool for explaining the model
explainer <- function(model, variable_matrix) {
explainer <- lime(as.data.frame(variable_matrix), model)
explanation <- explain(as.data.frame(variable_matrix)[1:4,], explainer, n_labels = 1, n_features = 5)
explanation
}
explanation2 <- explainer(model2, matTrain2)
print(explanation2)## # A tibble: 20 × 13
## model_type case label label_prob model_r2 model_intercept model_prediction
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 classificat… 1 0 0.990 0.397 0.223 0.990
## 2 classificat… 1 0 0.990 0.397 0.223 0.990
## 3 classificat… 1 0 0.990 0.397 0.223 0.990
## 4 classificat… 1 0 0.990 0.397 0.223 0.990
## 5 classificat… 1 0 0.990 0.397 0.223 0.990
## 6 classificat… 2 1 0.992 0.265 0.458 0.962
## 7 classificat… 2 1 0.992 0.265 0.458 0.962
## 8 classificat… 2 1 0.992 0.265 0.458 0.962
## 9 classificat… 2 1 0.992 0.265 0.458 0.962
## 10 classificat… 2 1 0.992 0.265 0.458 0.962
## 11 classificat… 3 1 0.956 0.0304 0.582 0.956
## 12 classificat… 3 1 0.956 0.0304 0.582 0.956
## 13 classificat… 3 1 0.956 0.0304 0.582 0.956
## 14 classificat… 3 1 0.956 0.0304 0.582 0.956
## 15 classificat… 3 1 0.956 0.0304 0.582 0.956
## 16 classificat… 6 1 0.989 0.267 0.433 1.04
## 17 classificat… 6 1 0.989 0.267 0.433 1.04
## 18 classificat… 6 1 0.989 0.267 0.433 1.04
## 19 classificat… 6 1 0.989 0.267 0.433 1.04
## 20 classificat… 6 1 0.989 0.267 0.433 1.04
## # ℹ 6 more variables: feature <chr>, feature_value <dbl>, feature_weight <dbl>,
## # feature_desc <chr>, data <list>, prediction <list>
# use the xg.importance to show the distribution of feature weight
importances <- xgb.importance(model = model2)
xgb.plot.importance(importances)In the first table, the most valuable column is “feature weight”, it stands for the importance of a feature to the final target variable. In the middle chart, it is the 4 times case observations from LIME. In each of observation, there are two color bars shows if the feautre has the positive or negative impacts on the target variable, and also the extent of the impact. In the third chart, it shows the average impact of the features to the target variable.
Therefore, based on the general understanding of the above three chart, “last_month_view” and “region” have the highest feature weight value, which means both act as important roles in this model. Between them, the “last month view” has slight higher impacts on the target variable than “Region”. Except these two features, all other variables show a limit impact to the final target variable.
Sadly, as this classifier is kind of black-box model, the process of the analysis might cannot be explained as clear as the above interpretation.
The Logistic Regression Model is a machine learning method to calculate the probability of a certain class. It produces a S-shaped curve for the regression and estimate the probability with the math formula.
The Logistic Regression Model process is divided into parts including:
#refresh the data
dTrain1 <- dTrain %>%
select(all_of(c(selVars, outcome)))
dTrain2 <- dTrain %>%
select(all_of(c(selVar2, outcome)))
dTrain3 <- dTrain %>%
select(all_of(c(selVar3, outcome)))
dTest1 <- dTest %>%
select(all_of(c(selVars, outcome)))
dTest2 <- dTest %>%
select(all_of(c(selVar2, outcome)))
dTest3 <- dTest %>%
select(all_of(c(selVar3, outcome)))
dCal1 <- dCal %>%
select(all_of(c(selVars, outcome)))
dCal2 <- dCal %>%
select(all_of(c(selVar2, outcome)))
dCal3 <- dCal %>%
select(all_of(c(selVar3, outcome)))## income_class~ region + last_month_subscribe + last_month_view
## Feature dimension = 3
gmodel1 <- glm(as.formula(f), data=dTrain1, family=binomial(link='logit'))
print(calcAUC(predict(gmodel1, newdata=dTrain1), dTrain1[,outcome] )) ## [1] 0.978259
## [1] 0.9531497
## [1] 0.9935432
## income_class~ region + views + uploads + last_month_view + last_month_subscribe + subscribers_for_last_30_days_isBAD + numeric_date
## Feature dimension = 7
Same as above, both AUC and performance matrix will be used to evaluate the 3 variable combinations under the logistic regression metod
## [1] 0.9785234
## [1] 0.9539993
## [1] 0.9961663
## income_class~ subscribers + views + uploads + last_month_view + last_month_subscribe + subscribers_for_last_30_days_isBAD + numeric_date + region
## Feature dimension = 8
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## [1] 0.9787879
## [1] 0.9547275
## [1] 0.9953592
The modeling processing is almost the same with decision tree but the AUC values are almost reach 1. From the AUC score above, the second logistic regression model is better based on the result of the test data of model 2. A further analysis will also be processed later with other evaluation indicators.
#ROC plot
pred_gmodel3_roc <- predict(gmodel3, newdata=dTest3)
pred_gmodel2_roc <- predict(gmodel2, newdata=dTest2)
pred_gmodel1_roc <- predict(gmodel1, newdata=dTest1)
plot_roc(pred_gmodel1_roc, dTest1[[outcome]],
pred_gmodel2_roc, dTest2[[outcome]],
pred_gmodel3_roc, dTest3[[outcome]])The plot shows how these 3 models performed. However, this is not enough to evaluate the model because the plot is not well enough to show the difference due to the blurred edges of lines. So we need to use the performance matrix to compare the models.
#performance matrix
pretty_perf_table(gmodel1, dTrain1[selVars], dTrain1[,outcome]==pos, dTest1[selVars], dTest1[,outcome]==pos)##
##
## model precision recall f1 dev.norm
## ---------- ----------- -------- -------- ----------
## training 0.9613 0.9030 0.9313 -2.752
## test 0.9242 0.7922 0.8531 -2.444
pretty_perf_table(gmodel2, dTrain2[selVar2], dTrain2[,outcome]==pos, dTest2[selVar2], dTest2[,outcome]==pos)##
##
## model precision recall f1 dev.norm
## ---------- ----------- -------- -------- ----------
## training 0.9643 0.9000 0.9310 -2.801
## test 0.9254 0.8052 0.8611 -2.548
pretty_perf_table(gmodel3, dTrain3[selVar3], dTrain3[,outcome]==pos, dTest3[selVar3], dTest3[,outcome]==pos)##
##
## model precision recall f1 dev.norm
## ---------- ----------- -------- -------- ----------
## training 0.9613 0.9030 0.9313 -2.809
## test 0.9242 0.7922 0.8531 -2.496
From the statistics above, the second one is still the best, which would be selected for the following explanation.
Because LIME does not support glm (logit) function, an alternative way to train the logistic regression is required for the interpretation. In this case, Caret package is used for the model training.
#caret glm (instead of original one)
# Function to calculate AUC using ROCR
calcAUC_ROCR <- function(preds, truth) {
pred_obj <- prediction(preds, truth)
perf_obj <- performance(pred_obj, measure = "auc")
return(perf_obj@y.values[[1]])
}
# Model 1
f1 <- as.formula(paste(outcome,'~ ', paste(selVars, collapse=' + ')))
cat(deparse(f1))## income_class ~ region + last_month_subscribe + last_month_view
## Feature dimension = 3
gmodel1_caret <- train(f1, data=dTrain1, method="glm", family=binomial(link='logit'))
print(calcAUC_ROCR(predict(gmodel1_caret, dTrain1), dTrain1[,outcome]))## [1] 0.978259
## [1] 0.9531497
## [1] 0.9935432
## income_class ~ region + views + uploads + last_month_view + last_month_subscribe + subscribers_for_last_30_days_isBAD + numeric_date
## Feature dimension = 7
gmodel2_caret <- train(f2, data=dTrain2, method="glm", family=binomial(link='logit'))
print(calcAUC_ROCR(predict(gmodel2_caret, dTrain2), dTrain2[,outcome]))## [1] 0.9785234
## [1] 0.9539993
## [1] 0.9961663
## income_class ~ subscribers + views + uploads + last_month_view + last_month_subscribe + subscribers_for_last_30_days_isBAD + numeric_date + region
## Feature dimension = 8
gmodel3_caret <- train(f3, data=dTrain3, method="glm", family=binomial(link='logit'))
print(calcAUC_ROCR(predict(gmodel3_caret, dTrain3), dTrain3[,outcome]))## [1] 0.9787879
## [1] 0.9547275
## [1] 0.9953592
Although the different tools are used in the logistic regression, the AUC scores for them are same. So these 3 models could be authorized that they are the same with the 3 before.
For the model interpretation, LIME and model 2 would be selected to show how this works.
#LIME conduct
explainer2 <- lime(dTrain2, gmodel2_caret, bin_continuous = TRUE)
observation2 <- dTest2[1:4, !(names(dTest2) %in% c("income_class"))]
explanation2 <- explain(observation2, explainer2, n_features = 7)
print(explanation2)## # A tibble: 28 × 11
## model_type case model_r2 model_intercept model_prediction feature
## <chr> <chr> <dbl> <dbl> <dbl> <chr>
## 1 regression 5 0.108 0.534 0.650 region
## 2 regression 5 0.108 0.534 0.650 views
## 3 regression 5 0.108 0.534 0.650 uploads
## 4 regression 5 0.108 0.534 0.650 last_month_view
## 5 regression 5 0.108 0.534 0.650 last_month_subscr…
## 6 regression 5 0.108 0.534 0.650 subscribers_for_l…
## 7 regression 5 0.108 0.534 0.650 numeric_date
## 8 regression 15 0.0483 0.457 0.953 region
## 9 regression 15 0.0483 0.457 0.953 views
## 10 regression 15 0.0483 0.457 0.953 uploads
## # ℹ 18 more rows
## # ℹ 5 more variables: feature_value <chr>, feature_weight <dbl>,
## # feature_desc <chr>, data <list>, prediction <dbl>
From the LIME explainer, similar as the previous LIME, “Region” and “last_month_view” has the largest feature weights which means they are the most important features in the classification model. For instance, the feature weight of last_month_view,view and numeric_date have negative scores in the first observation, which means they have no good effects on the model.
From the plot for variable importance, the “region” feature was automatically divided into several “0” or “1” dummy features for processing. Although they are divided into many variables, it still indicates that “region” and “last_month_view” are the most important features in the model as the top bar in the chart are either region or last month view relevant. One point should be emphasized is that “region North America” shows the most importance in the chart, which means the region of North America might has vital impact to the entire model.
After the process above, 3 models are selected from 9 models which have the best performance variable combination from each classifier. In this phase, 1 out of 3 models will be selected by considering their final performance. The performance will be evaluated by 2 section:
#Performance Matrix and visualization edited by GPT-3.5
pretty_perf_table <- function(model, xtrain, ytrain, xtest, ytest, threshold=0.5) {
# Option setting for Pander
panderOpt()
perf_justify <- "lrrrr"
# call the predict() function to do the predictions
pred_train <- predict(model, newdata=xtrain)
pred_test <- predict(model, newdata=xtest)
# comparing performance on training vs. test
trainperf_df <- performanceMeasures( ytrain, pred_train, model.name="training", threshold=threshold)
testperf_df <- performanceMeasures( ytest, pred_test, model.name="test", threshold=threshold)
# combine the two performance data frames using rbind()
perftable <- rbind(trainperf_df, testperf_df)
return(perftable)
}
#LOGISTIC
perf1 <- pretty_perf_table(gmodel3, dTrain3[selVar3], dTrain3[,outcome]==pos, dTest3[selVar3], dTest3[,outcome]==pos)
#DECISION TREE
perf2 <- pretty_perf_table(t2model, dTrain2[selVar2], dTrain2[,outcome]==pos, dTest2[selVar2], dTest2[,outcome]==pos)
#XGBOOST
perf3 <- pretty_perf_table(model2, matTrain2, dTrain2$income_class, matTest2, dTest2$income_class)
perf1$model_type <- "Logistic Regression #3"
perf2$model_type <- "Decision Tree #2"
perf3$model_type <- "XGBoost #2"
perf1_test <- subset(perf1, model == "test")
perf2_test <- subset(perf2, model == "test")
perf3_test <- subset(perf3, model == "test")
all_perf_test <- rbind(perf1_test, perf2_test, perf3_test)
all_perf_long <- gather(all_perf_test, metric, value, -model_type, -model)
ggplot(all_perf_long, aes(x=model_type, y=value, color=model_type)) +
geom_point(size=4, position=position_dodge(0.6)) +
facet_wrap(~metric, scales="free_y", ncol=1) +
labs(title="Test Data Performance", y="Value") +
theme_minimal() +
theme(legend.position="right")This point plot is designed for checking the performance of each model by the data from evaluations on test data before, which clearly shows which model have the better performance on each criteria.
For the f1 score, XGBoost #2 has the highest score, followed by Logistic Regression #3 and then Decision Tree #2.
In terms of precision, Logistic Regression #3 leads, followed closely by XGBoost #2, with Decision Tree #2 lagging slightly behind.
When looking at recall, Decision Tree #2 and XGBoost #2 have comparable scores, both being higher than that of Logistic Regression #3.
From the dev.norm, Decision Tree #2 and XGBoost #2 have better performance than that of Logistic Regression #3.
From the plot above, XGBoost #2 model distinctively has the best performance of all models.
Move to ROC evaluation.
#ROC plot inproved by GPT-3.5
roc_decision_tree <- roc(dTest3[,outcome], predict(t3model, newdata=dTest3))
roc_xgboost2 <- roc(dTest2$income_class, predict(model2, newdata=matTest2))
roc_logistic_regression <- roc(dTest2[,outcome], as.numeric(predict(gmodel2_caret, dTest2)))
ggroc(list(Decision_Tree_3=roc_decision_tree, XGBoost_2=roc_xgboost2, LogisticRegression_2=roc_logistic_regression))The ROC curve also shown that XGBoost #2 is the best model of all as the curve is closer to the top left side comparing with others.
According to the evaluation of the best models in each classifiers, the final best model out of 9 is XGboost classifier with the second variable combination selected from Boruta Algorithm, the variables selected are: “region”, “views”, “uploads”,“last_month_view”, “last_month_subscribe”, “subscribers_for_last_30_days_isBAD”, “numeric_date”,“subscribers_for_last_30_days_isBAD” 7 in total. The interpretation has been showed in 2.3.2.3.
Clustering is an unsupervised method. In this method, no target variable is required, and the algorithm will group all the observations into several groups, meanwhile keep a good performance as the specific assessment method.
The clustering part will be divided as following steps:
#set a new data set for clustering, extracvt the valid vairbales from the processed df
df.cluster <- df[,c(catvars,numvars)]#group the variables into numeric and category based on the type
cluster.catvars <- colnames(df.cluster)[sapply(df.cluster[, colnames(df.cluster)], class) %in%
c('factor', 'character')]
cluster.numvars <- colnames(df.cluster)[sapply(df.cluster[, colnames(df.cluster)], class) %in%
c('numeric', 'integer')]
#check the varibales
cluster.catvars## [1] "region" "category"
## [1] "subscribers" "views"
## [3] "uploads" "last_month_view"
## [5] "last_month_subscribe" "subscribers_for_last_30_days_isBAD"
## [7] "numeric_date"
For the clustering data frame, there are 2 category variables and 7 numeric variables. In the previous steps, the data cleaning and transformation has cleaned and dropped some meaningless and redundant variables, delete or fill some noise or missing value, reduce levels in category variables to keep the model tidy. These will help the further process greatly.
However, as the clustering methods used have some limitation on the format of variable, further process will be applied:
#convert category variables to dummy
dummy_vars <- model.matrix(~ . - 1, data = df.cluster[, cluster.catvars])
df.cluster <- df.cluster[, -which(names(df.cluster) %in% cluster.catvars)]
df.cluster <- cbind(df.cluster, dummy_vars)
colnames(df.cluster) <- gsub("(.*)\\((.*)\\)", "\\2_\\1", colnames(df.cluster))
#delete all cate variables
df.cluster$cluster.catvars <- NULL#Scale all of the numeric columns in df.clustering
vars.to.use <- colnames(df.cluster)
scaled_df <- scale(df.cluster[,vars.to.use])## subscribers views
## 2.275311e+07 1.133589e+10
## uploads last_month_view
## 9.806515e+03 1.599906e+08
## last_month_subscribe subscribers_for_last_30_days_isBAD
## 3.582064e+05 3.143777e-01
## numeric_date regionAsia Pacific
## 1.576098e+04 9.656652e-02
## regionEurope regionMiddle East
## 6.652361e-02 1.824034e-02
## regionNorth America regionOthers
## 3.369099e-01 1.695279e-01
## regionSouth America regionSouth Asia
## 1.266094e-01 1.856223e-01
## categoryEducation categoryEntertainment
## 6.115880e-02 3.251073e-01
## categoryFilm & Animation categoryGaming
## 4.828326e-02 9.549356e-02
## categoryHowto & Style categoryMissing
## 4.077253e-02 3.969957e-02
## categoryMusic categoryNews & Politics
## 2.060086e-01 2.789700e-02
## categoryNonprofits & Activism categoryPeople & Blogs
## 2.145923e-03 1.351931e-01
## categoryPets & Animals categorySports
## 4.291845e-03 1.180258e-02
## subscribers views
## 1.681243e+07 1.446043e+10
## uploads last_month_view
## 3.520186e+04 2.943851e+08
## last_month_subscribe subscribers_for_last_30_days_isBAD
## 5.139967e+05 4.645168e-01
## numeric_date regionAsia Pacific
## 1.637390e+03 2.955252e-01
## regionEurope regionMiddle East
## 2.493289e-01 1.338913e-01
## regionNorth America regionOthers
## 4.729076e-01 3.754190e-01
## regionSouth America regionSouth Asia
## 3.327135e-01 3.890103e-01
## categoryEducation categoryEntertainment
## 2.397500e-01 4.686664e-01
## categoryFilm & Animation categoryGaming
## 2.144792e-01 2.940533e-01
## categoryHowto & Style categoryMissing
## 1.978690e-01 1.953573e-01
## categoryMusic categoryNews & Politics
## 4.046539e-01 1.647661e-01
## categoryNonprofits & Activism categoryPeople & Blogs
## 4.629922e-02 3.421133e-01
## categoryPets & Animals categorySports
## 6.540654e-02 1.080546e-01
The first clustering method is called hierarchical clustering. This method using hierarchical method to approach the best performed clustering. In this project, divisive method (top-down) will be used to separate all the data points from 1 to many.
The logic under the method is Ward’s method, which will minimize the distance among data points within same cluster/group. The distance calculation will follow the Euclidean distance.
In order to make the distance for points inside one group is as small as possible, a indicator to assess the distance will be built, and then try from K=1 to K=n to see when the model can obtain the best indicator number, that would be the number of groups that H-clustering can perform the best.
For the assessment, there are two different indicators will be calculated:
In the project, both assess methods will be tried.
#define function of WSS, WSS total, TSS
wss <- function(clustermat) {
c0 <- colMeans(clustermat)
sum(apply( clustermat, 1, FUN=function(row) {sqr_euDist(row, c0)} )) }
wss_total <- function(scaled_df, labels) {
wss.sum <- 0
k <- length(unique(labels))
for (i in 1:k)
wss.sum <- wss.sum + wss(subset(scaled_df, labels == i))
wss.sum
}
tss <- function(scaled_df) {
wss(scaled_df) }#define function of CH index
CH_index <- function(scaled_df, kmax, method = "kmeans") {
if (!(method %in% c("kmeans", "hclust")))
stop("method must be one of c('kmeans', 'hclust')")
npts <- nrow(scaled_df)
wss.value <- numeric(kmax)
wss.value[1] <- wss(scaled_df)
if (method == "kmeans") {
for (k in 2:kmax) {
clustering <- kmeans(scaled_df, k, nstart = 10, iter.max = 100)
wss.value[k] <- clustering$tot.withinss
}
} else {
d <- dist(scaled_df, method = "euclidean")
pfit <- hclust(d, method = "ward.D2")
for (k in 2:kmax) {
labels <- cutree(pfit, k = k)
wss.value[k] <- wss_total(scaled_df, labels)
}
}
bss.value <- tss(scaled_df) - wss.value
B <- bss.value / (0:(kmax-1))
W <- wss.value / (npts - 1:kmax)
data.frame(k = 1:kmax, CH_index = B/W, WSS = wss.value)
}# calculate and plot total WSS and CH index
crit.df <- CH_index(scaled_df, 50, method="hclust")
#plot the total WSS
fig1 <- ggplot(crit.df, aes(x=k, y=WSS), color="blue") + geom_point() + geom_line(colour="blue") + scale_x_continuous(breaks=1:50, labels=1:50) + theme(text=element_text(size=10))+labs(title = "Total WSS Plot")
#plot the CH index
fig2 <- ggplot(crit.df, aes(x=k, y=CH_index)) + geom_point() + geom_line(colour="red") + scale_x_continuous(breaks=1:50, labels=1:50) + labs(y="CH index") + theme(text=element_text(size=10))+ labs(title = "CH index Plot")
fig1From the total WSS index, there is no significant elbow shape showed on the chart. However, in the CH index chart the elbow occurs when K=21. It indicates that when cluster all data points into 21 groups, the clustering perform best.
Here plot K=21 to see the clusters directly:
#set the k as k=21
groups <- cutree(pfit, k=21)
print_clusters <- function(df, groups, cols_to_print) { Ngroups <- max(groups)
for (i in 1:Ngroups) {
print(paste("cluster", i))
print(df[groups == i, cols_to_print]) }
}#check the cluster of 21 groups, it might hard to show in the html document
print_clusters(df.cluster, groups)#to visualise the cluster, the dimension of the variables will be reduced to 2 (named principle component)
princ <- prcomp(scaled_df) # Calculate the principal components of scaled_df
nComp <- 2
project2D <- as.data.frame(predict(princ, newdata=scaled_df)[,1:nComp])
hclust.project2D <- cbind(project2D, cluster=as.factor(groups))# finding the convex hull
find_convex_hull <- function(proj2Ddf, groups) {
do.call(rbind,
lapply(unique(groups),
FUN = function(c) {
f <- subset(proj2Ddf, cluster==c);
f[chull(f),]
}
)
)
}
hclust.hull <- find_convex_hull(hclust.project2D, groups)#plot the clustering with k=21
ggplot(hclust.project2D, aes(x=PC1, y=PC2)) +
geom_point(aes(shape=cluster, color=cluster)) +
geom_text(aes(label= '', color=cluster), hjust=0, vjust=1, size=3) +
geom_polygon(data=hclust.hull, aes(group=cluster, fill=as.factor(cluster)),
alpha=0.4, linetype=0) + theme(text=element_text(size=10))+
labs(title = "H-Clustering Plot (K=21)")Although K=21 from the index indicates the model performs best, when the clusters are plotted, there is a mess and huge overlap among these groups. That may be traced from the low number of observations.
Therefore, although H-clustering deliver an output with K=21, the performance is still far away from expectations, and it is hard to interpret some valuable information from it. Another clustering method will be conducted in the project.
K-means clustering is another method. In the process, a hyper-parameter k will be selected first as the number of clusters’ centres, then arrange all the data points to the nearest centre. After that, from each group, calculate the actual centre and in the end, rearrange all data points to the new nearest centre.
This process will be repeated many times until the centre keep stable. In the practice, calculation will applied for 100 times to minimize the contingency bring from the randomly chosed hyper-parameter - initial K.
#set the hyper parameter as 3, set run time as 100, and random starts as 100
kbest.p <- 3
kmClusters <- kmeans(df.kmean, kbest.p, nstart=100, iter.max=100)## subscribers views uploads last_month_view last_month_subscribe
## 1 35985792 24329423118 16089.279 349657235 474751.7
## 2 125822222 116941533177 45595.889 1277423222 850911.8
## 3 18227162 6838227183 7817.528 99496197 323392.7
## subscribers_for_last_30_days_isBAD numeric_date
## 1 0.1912568 15394.34
## 2 0.1111111 14616.56
## 3 0.3472973 15865.57
## [1] 183 9 740
Similarly, there are also two assess methods to evaluate the clustering:
#run to pick the best K by CH index
kmClustering.ch <- kmeansruns(df.kmean, krange=1:50, criterion="ch")
kmClustering.ch$bestk## [1] 26
#run to pick the best K by Average silhouette width (AWS)
kmClustering.asw <- kmeansruns(df.kmean, krange=1:50, criterion="asw")
kmClustering.asw$bestk## [1] 2
#draw plots for two methods
kmCritframe <- data.frame(k=1:50, ch=kmClustering.ch$crit,
asw=kmClustering.asw$crit)
fig1 <- ggplot(kmCritframe, aes(x=k, y=ch)) +
geom_point() +
geom_line(colour="red") +
scale_x_continuous(breaks=1:50, labels=1:50) +
labs(y="CH index", title="CH index Plot") + theme(text=element_text(size=10))
fig2 <- ggplot(kmCritframe, aes(x=k, y=asw)) + geom_point() +
geom_line(colour="blue") +
scale_x_continuous(breaks=1:50, labels=1:50) +
labs(y="ASW", title="ASW Plot") +
theme(text=element_text(size=10))
fig1After the process, K means clustering finds out 3 best K by two evaluation methods.
For the CH index, there is a local peak point as K=10, and there is a global peak as K=16
For ASW, there is a notable global peak as K=2.
Three charts will be plotted to visualize clustering with three different K (2, 10, and 16).
fig <- c()
kvalues <- c(2,10,16)
for (k in kvalues) {
groups <- kmeans(scaled_df, k, nstart=100, iter.max=100)$cluster
kmclust.project2D <- cbind(project2D, cluster=as.factor(groups))
kmclust.hull <- find_convex_hull(kmclust.project2D, groups)
assign(paste0("fig", k),
ggplot(kmclust.project2D, aes(x=PC1, y=PC2)) + geom_point(aes(shape=cluster, color=cluster)) + geom_polygon(data=kmclust.hull, aes(group=cluster, fill=cluster),
alpha=0.4, linetype=0) + labs(title = sprintf("K means Clustering: k = %d", k)) +
theme(legend.position="none", text=element_text(size=10))
) }kbest.p <- 2
kmClusters <- kmeans(df.kmean, kbest.p, nstart=100, iter.max=100)
groups <- kmClusters$cluster#check the different clusters. It is hard to show in HTML, so will only show the code
print_clusters(df.kmean, groups)As shows in the clusters map with three different k, both k=10 and k=16 illustrate a mess and overlapped clustering. Comparing with these, chart with k=2 has a clearer clustering, which is relatively acceptable. However, there is still a huge area overlapped, which can be consider as low efficient clustering.
In summary, Two fundamental methods are used to cluster the data set. Although in the process of seeking the best k value is smooth, when plot the clusters in 2 dimensions, both H clustering and K means clustering do not reach an outstanding result: all the clusters are overlapped to some extend. It shows that the similarity inside same clusters is less and the differences between different clusters are less, which is not a good method to be applied.
There are several reasons can be traced: That may caused by the limited quantity of observations, the high noise of variables, or the less differential features. Therefore, comparing to the clustering, classification is a advisable method to explore the regularity in this case.
In this project, several model generation and evaluation methods have been practiced.
In the data preparation phase, the original data is cleaned and transformed to make sure it can be effectively applied in the classification and clustering. One target variable is also been decided and processed to apply the classification prediction.
In the classification phase, data was further process (split) in order to train the models. 3 classifiers and 3 variable combinations are put together as 9 models. After several round of training and evaluation, a final best model was selected, which is XGBoost classifier and 7 selected features from Boruta Algorithm. This model performs relatively higher in different dimensions, no matter on ROC or performance matrix. The questions raised in the beginning of the classification are answered properly.
In the clustering phase, which is an unsupervised process, 2 methods are used to clustering. Between them, the K-means clustering shows a reasonable cluster k (which is 2). However, based on the plot of both methods on two-dimensions, none of them has a good performance as the actual clusters have so much overlap, which may indicates the unseparated features of the data points. It may due to the limitation of the quality of data, or the limitation of the clustering method.
Overall, the report analyses the Youtuber Data set much deeper than the Project 1, the classification model generated and the clustering method practiced bring the audiences a new insight of the data frame.
The last part is the Shiny APP illustration, here is the Youtube Video link and code:
https://www.youtube.com/watch?v=E-a7WDmpd4Q
library(shiny)
library(ggplot2)
library(dplyr)
library(ggthemes)
library(ROCit)
library(pROC)
ui <- navbarPage(
title = "CITS4009 Project 2",
h5("Mingyu LIAN: 24046428 & Chensu YANG: 24035732"),
tabPanel(
"Single Variable Model Evaluation",
h3("ROC Plot for Single Variables", style = "color:Darkred"),
p("The ROC Plot shows the performance of each single variable model. There are 9 models in total will be compared, each of the model have its corresponding single variable. The closer the curve is to the top left corner, the better the model is."),
p("There is also a red point line shows the null model performance, which can be treated as the worst model. Normally, each of the single variable model should have a better performance than the null model, which means the curve should be above the red point line.
"),
h5("Please use the multi-select box to choose the variables you want to plot.",style = "color:grey"),
sidebarLayout(
sidebarPanel(
checkboxGroupInput("selected_columns", "Choose Variable to Plot:",
choices = c("region", "lst mth view", "uploads", "lst mth subscribe","subscribers","views","Vtreat Missing","numeric date","category"),
selected = c("region","lst mth view", "uploads", "lst mth subscribe")) ,
p('__________________________________'),
h4("Conclusion"),
p("From the observations, 4 variables: region, last month view, last month subcribe have notable better performance than other variables."),
tags$a(href="https://www.youtube.com/watch?v=E-a7WDmpd4Q", "Click Here to Watch the Guide Video!")),
mainPanel(
plotOutput("ROC")
)
)
),
tabPanel(
"Multi-Variable Models Evaluation",
h3("Model Indicators Comparison", style = "color:Darkred"),
p("This comparison chart compares the peformance of 3 classifiers with its best variables combination respectively on test set. They are: Decision Tree Classifier with Variable Combination 2,
Logistic Regression Classifier with Variable Combination 3 and XGBoost Classifier with Variable Combination 2.Besides, There are 4 common used indicators in the charts,
they are: dev.norm, f1, precision and recall.
The higher the indicator value is, the better the model is."),
p("You can compare the performance of the 3 models by choosing the indicator you want to compare."),
h5("Please use the Radio Point to choose which indicator you would like to compare",style = "color:grey"),
sidebarLayout(
sidebarPanel(
radioButtons("indicator", "Choose an Indicator:",
choices = c("dev.norm", "f1", "precision", "recall"),
selected = c("dev.norm")) ,
p('__________________________________'),
h4("Conclusion"),
p("From the observations, the third model (XGBoost Classifier with Variable Combination 2) has a reletively good performance, especially for f1 indicator.
Therefore, we choose this set as the final best model"),
tags$a(href="https://www.youtube.com/watch?v=E-a7WDmpd4Q", "Click Here to Watch the Guide Video!")),
mainPanel(
plotOutput("perfPlot")
)
)
),
tabPanel(
"K-Means Clustering",
h3("Clustering Plot (Two-Dimensional Display)", style = "color:Darkred"),
p("This clustering plot shows the result of K-Means Clustering. The plot is a two-dimensional display of the data.
The x-axis is the first principal component, and the y-axis is the second principal component.
The color of the points represents the cluster that the data point belongs to. The number of clusters is determined by the user. "),
h5("Please use the slider choose the number of clusters you want to plot.",style = "color:grey"),
sidebarLayout(
sidebarPanel(
sliderInput("selectn", label="Choose k as clusters number:",
min=2, max=10,value=2) ,
p('__________________________________'),
h4("Conclusion"),
p("From the evaluation method metioned in the report,
the clustering perform best when k=2. When check the k=2,
it shows a reletively clear clusters than k = others. However, there is still a huge overlap between two different clusters,
which indicates it might not a good clustering. The reason might be traced from the amount limitation of obersvations. Therefore, the clustering might not be a good method for this dataset."),
tags$a(href="https://www.youtube.com/watch?v=E-a7WDmpd4Q", "Click Here to Watch the Guide Video!")),
mainPanel(
plotOutput("cluster")
)
)
)
)## Warning: Navigation containers expect a collection of
## `bslib::nav_panel()`/`shiny::tabPanel()`s and/or
## `bslib::nav_menu()`/`shiny::navbarMenu()`s. Consider using `header` or `footer`
## if you wish to place content above (or below) every panel's contents.
server <- function(input, output) {
output$ROC <- renderPlot({
plot_roc <- function(predcol, outcol, colour_id=10, overlaid=F){
ROCit_obj <- rocit(score=predcol, class=outcol==pos)
par(new=overlaid)
plot(ROCit_obj,
col = c(colour_id, 10), legend = FALSE, YIndex = FALSE, values = FALSE)
}
if ("region" %in% input$selected_columns)
plot_roc(shiny1$predregion, shiny1[,outcome],colour_id=1)
if ("lst mth view" %in% input$selected_columns)
plot_roc(shiny1$predlast_month_view, shiny1[,outcome], colour_id=2, overlaid=T)
if ("uploads" %in% input$selected_columns)
plot_roc(shiny1$preduploads, shiny1[,outcome],colour_id=3,overlaid=T)
if ("lst mth subscribe" %in% input$selected_columns)
plot_roc(shiny1$predlast_month_subscribe, shiny1[,outcome], colour_id=4, overlaid=T)
if ("subscribers" %in% input$selected_columns)
plot_roc(shiny1$predsubscribers, shiny1[,outcome], colour_id=5, overlaid=T)
if ("views" %in% input$selected_columns)
plot_roc(shiny1$predviews, shiny1[,outcome], colour_id=6, overlaid=T)
if ("Vtreat Missing" %in% input$selected_columns)
plot_roc(shiny1$predsubscribers_for_last_30_days_isBAD, shiny1[,outcome], colour_id=7, overlaid=T)
if ("numeric date" %in% input$selected_columns)
plot_roc(shiny1$prednumeric_date, shiny1[,outcome], colour_id=8, overlaid=T)
if ("category" %in% input$selected_columns)
plot_roc(shiny1$predcategory, shiny1[,outcome], colour_id=13, overlaid=T)
legend("bottomright", legend = input$selected_columns, col = 1:(length(input$selected_columns) + 1), lty = 1)
})
output$cluster <- renderPlot({
k <- input$selectn
groups <- kmeans(scaled_df, k, nstart=100, iter.max=100)$cluster
kmclust.project2D <- cbind(project2D, cluster=as.factor(groups))
kmclust.hull <- find_convex_hull(kmclust.project2D, groups)
ggplot(kmclust.project2D, aes(x=PC1, y=PC2)) + geom_point(aes(shape=cluster, color=cluster)) +
geom_polygon(data=kmclust.hull, aes(group=cluster, fill=cluster), alpha=0.4, linetype=0) +
labs(title = sprintf("Clustering Plot - K = %d", k)) +
scale_x_continuous(name = "Principal Component 1") + scale_y_continuous(name = "Principal Component 2") +
theme(legend.position="bottom", text=element_text(size=15),
plot.title = element_text(hjust = 0.5, size = 19, face = "bold",color="darkred"),
axis.title.x = element_text(size = 13), axis.title.y = element_text(size = 13))
})
output$perfPlot <- renderPlot({
data_to_plot <- all_perf_long[all_perf_long$metric == input$indicator,]
ggplot(data_to_plot, aes(x=model_type, y=value, color=model_type)) +
geom_point(size=8, position=position_dodge(0.8),show.legend = FALSE) +
labs(title="Indicator Performance for 3 Classifiers", y="Indicator Value",x="") +
theme_minimal() +
theme(legend.position="bottom",text=element_text(size=15),
plot.title = element_text(hjust = 0.5, size = 19, face = "bold",color="darkred"),
axis.title.y = element_text(size = 16, face = "bold",color="darkgray"), axis.text.x = element_text(size = 15, face = "bold") ,axis.text.y = element_text(size = 15, face = "bold") )
})
}
shinyApp(ui = ui, server = server)World Bank. (2023). In World Bank Data. Retrieved October 15, 2023, from: https://data.worldbank.org/indicator/NY.GDP.PCAP.CD
ELGIRIYEWITHANA. E. (2023).Global Youtube Statistics 2023. In Kaggle. Retrieved October 15, 2023, from: https://www.kaggle.com/datasets/nelgiriyewithana/global-youtube-statistics-2023
GPT3.5. (2023) 2.3.2.2 AUC function for XGBoost introduced by GPT-3.5, from: https://chat.openai.com
GPT3.5. (2023) 2.4.1 Performance Matrix function edited by GPT-3.5. from: https://chat.openai.com
GPT3.5. (2023) 2.4.1 ROC plot improved by GPT-3.5. from: https://chat.openai.com